library(readr)
library(ggplot2)
library(dplyr)
library(glmnet)
theme_set(theme_minimal())
We will need two new packages today. You can install it with:
if (!require(xgboost))
{
install.packages("xgboost")
}
if (!require(randomForest))
{
install.packages("randomForest")
}
library(xgboost)
library(randomForest)
I think it is useful to do some simulations with trees to get a better grasp on exactly how they are using the data to do predictions. Let’s make a simulated dataset with 10k points and a data matrix with just two columns. The two columns are sampled uniformily in \([0,1]\). The output variable y will be 1 when the first variable is larger than the second and 0 otherwise.
n <- 10000
X <- matrix(runif(n * 2), ncol=2)
y <- as.numeric(X[,1] > X[,2])
The code below fits a very simply random forest model. The most only has a single split (maxnodes = 2 gives a single split with two outputs), always considers both variables (so the only randomness is the sample of data used), and only uses one tree. This is basically a decision tree on a random subset of the training data. The plot shows the predictions using color and puts a dot any misclassified data points. You can run this without making any changes:
model <- randomForest(X, factor(y), mtry = 2, maxnodes = 2, ntree = 1)
pred <- predict(model, newdata = X)
df <- tibble(y = y, x1 = X[,1], x2 = X[,2], pred = pred)
df %>%
ggplot(aes(x1, x2)) +
geom_point(aes(color = pred)) +
geom_point(color = "black", size=0.2, data = filter(df, pred != y)) +
geom_abline(slope = 1, intercept = 0)
Rerun the same block of code a few times. Notice that (1) the split tends to be around 0.5 and (2) sometimes a split on x1 is used and sometimes a splint on x2 is used. Make sure you understand both of these conditions.
Now, copy the code above in the block below. Change the maximum number of nodes to three and observe what the output looks like. Run it a couple of times to see the variation in each output. Can you deduce what the decision tree looks like each time?
model <- randomForest(X, factor(y), mtry = 2, maxnodes = 3, ntree = 1)
pred <- predict(model, newdata = X)
df <- tibble(y = y, x1 = X[,1], x2 = X[,2], pred = pred)
df %>%
ggplot(aes(x1, x2)) +
geom_point(aes(color = pred)) +
geom_point(color = "black", size=0.2, data = filter(df, pred != y)) +
geom_abline(slope = 1, intercept = 0)
Copy the code below once more. Change maxnodes to 5 and observe how the tree is trying to estimate the line with slope one and intercept zero. Increase the maxnodes again to 20 and take note of how well this model does compared to the simplier trees.
model <- randomForest(X, factor(y), mtry = 2, maxnodes = 20, ntree = 1)
pred <- predict(model, newdata = X)
df <- tibble(y = y, x1 = X[,1], x2 = X[,2], pred = pred)
df %>%
ggplot(aes(x1, x2)) +
geom_point(aes(color = pred)) +
geom_point(color = "black", size=0.2, data = filter(df, pred != y)) +
geom_abline(slope = 1, intercept = 0)
Copy the code one more time. Set maxnodes back to 2, but increase ntree to 3 and run the code a few times. You’ll see that sometimes it looks like a single split and sometimes it looks like two splits. What is going on?
model <- randomForest(X, factor(y), mtry = 2, maxnodes = 2, ntree = 3)
pred <- predict(model, newdata = X)
df <- tibble(y = y, x1 = X[,1], x2 = X[,2], pred = pred)
df %>%
ggplot(aes(x1, x2)) +
geom_point(aes(color = pred)) +
geom_point(color = "black", size=0.2, data = filter(df, pred != y)) +
geom_abline(slope = 1, intercept = 0)
The dataset for this lab is a collection of domestic flights within the U.S. during the year 2013. There are quite a large number of rows (400k) but a limited number of columns. Some of the most important columns, however, are categorical with a large number of values.
flights <- read_csv("https://github.com/statsmaths/ml_data/blob/master/flights_small.csv?raw=true")
ocol <- ncol(flights)
flights
## # A tibble: 6,000 x 12
## obs_id train_id delayed month day weekday arr_hour dep_hour origin
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 id_00… train 1 3 27 2 14 12 LAS
## 2 id_00… valid 1 8 18 6 19 17 PHL
## 3 id_00… train 0 3 14 3 21 20 BOS
## 4 id_00… valid 1 8 9 4 12 11 BOS
## 5 id_00… test NA 11 25 7 22 21 ATL
## 6 id_00… train 0 8 7 2 16 14 LAX
## 7 id_00… valid 0 9 30 7 20 19 ATL
## 8 id_00… test NA 11 12 1 19 17 EWR
## 9 id_00… test NA 12 9 7 8 6 DTW
## 10 id_00… valid 1 10 6 6 17 16 SAN
## # … with 5,990 more rows, and 3 more variables: dest <chr>,
## # distance <dbl>, carrier <chr>
Below I want you to use five different approaches to predict whether a flight will be delayed. Each time, produce (1) the mis-classification rate on the training and validation sets and (2) where applicable, produce a summary of what variables are most important in the prediction model. Generally, just use all of the variables, though you’ll have a bit more freedom to experiment with what terms to smooth and interact together in the GAM. I have kept from specifying some of the exact details. Play around with what seems to work best… also, make sure you are saving the script frequently in case something crashes!
X <- model.matrix( ~ -1 + ., data = flights[,seq(4, ocol)])
y <- flights$delayed
X_train <- X[flights$train_id == "train",]
y_train <- y[flights$train_id == "train"]
model <- cv.glmnet(X_train, y_train, family="binomial")
pred <- as.numeric(predict(model, newx = X) > 0)
sqrt(tapply((flights$delayed == pred)^2, flights$train_id, mean))
## test train valid
## NA 0.7739509 0.7677890
It will also be useful to see what the most important variables in the model are:
B <- coef(model, s = model$lambda[5])
B[B[,1] != 0,,drop=FALSE]
## 3 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.34811120
## arr_hour 0.01012931
## dep_hour 0.01432700
X <- model.matrix( ~ -1 + ., data = flights[,seq(4, ocol)])
y <- flights$delayed
X_train <- X[flights$train_id == "train",]
y_train <- y[flights$train_id == "train"]
X_valid <- X[flights$train_id == "valid",]
y_valid <- y[flights$train_id == "valid"]
for (k in c(3, 10, 25, 50, 100))
{
pred_valid <- FNN::knn(X_train, X_valid, y_train, k=k)
print(sqrt(mean((pred_valid == y_valid)^2)))
}
## [1] 0.7362065
## [1] 0.75
## [1] 0.7402702
## [1] 0.7379024
## [1] 0.7341662
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-23. For overview type 'help("mgcv-package")'.
model <- gam(delayed ~ origin + dest + s(dep_hour, arr_hour, distance),
data = flights,
family = binomial(),
subset = train_id == "train")
pred <- as.numeric(predict(model, newdata = flights) > 0)
sqrt(tapply((flights$delayed == pred)^2, flights$train_id, mean))
## test train valid
## NA 0.7940403 0.7681146
flights2 <- flights
flights2$origin <- factor(flights2$origin)
flights2$dest <- factor(flights2$dest)
flights2$carrier <- factor(flights2$carrier)
model <- randomForest(factor(delayed) ~ month + weekday + arr_hour + dep_hour + origin + dest + distance + carrier,
data = flights2,
subset = train_id == "train",
ntree = 100, maxnodes = 15)
pred <- as.numeric(predict(model, newdata = flights2)) - 1L
tapply(flights$delayed == pred, flights$train_id, mean)
## test train valid
## NA 0.677 0.607
X <- model.matrix( ~ -1 + ., data = flights[,seq(4, ncol(flights))])
y <- flights$delayed
X_train <- X[flights$train_id == "train",]
y_train <- y[flights$train_id == "train"]
X_valid <- X[flights$train_id == "valid",]
y_valid <- y[flights$train_id == "valid"]
data_train <- xgb.DMatrix(data = X_train, label = y_train)
data_valid <- xgb.DMatrix(data = X_valid, label = y_valid)
watchlist <- list(train=data_train, valid=data_valid)
model <- xgb.train(data = data_train,
max_depth = 5, eta = 0.003, nthread = 2,
nrounds = 100,
objective = "binary:logistic",
watchlist = watchlist,
verbose=1, print_every_n=10, early_stopping_rounds = 100)
## [1] train-error:0.371000 valid-error:0.430500
## Multiple eval metrics are present. Will use valid_error for early stopping.
## Will train until valid_error hasn't improved in 100 rounds.
##
## [11] train-error:0.356000 valid-error:0.443500
## [21] train-error:0.357000 valid-error:0.443500
## [31] train-error:0.356500 valid-error:0.446000
## [41] train-error:0.355000 valid-error:0.447500
## [51] train-error:0.355000 valid-error:0.446500
## [61] train-error:0.354500 valid-error:0.446500
## [71] train-error:0.354500 valid-error:0.446500
## [81] train-error:0.358000 valid-error:0.447500
## [91] train-error:0.357500 valid-error:0.445000
## [100] train-error:0.360500 valid-error:0.434500
importance_matrix <- xgb.importance(model = model)
importance_matrix
## Feature Gain Cover Frequency
## 1: dep_hour 0.3169062814 2.222412e-01 0.1271911945
## 2: distance 0.1539465537 1.149334e-01 0.1630656339
## 3: day 0.0601610199 5.722906e-02 0.0835711374
## 4: destDEN 0.0597558511 1.008294e-01 0.0668569099
## 5: arr_hour 0.0564554148 8.910177e-02 0.0835711374
## 6: month 0.0446229794 5.413137e-02 0.0766408479
## 7: weekday 0.0394430501 1.847148e-02 0.0986547085
## 8: destCLT 0.0379445393 5.168204e-02 0.0338361190
## 9: destEWR 0.0317800262 6.194631e-02 0.0326131268
## 10: carrierAA 0.0309772697 4.546447e-03 0.0297594782
## 11: destMDW 0.0246821284 3.607814e-02 0.0260905014
## 12: originJFK 0.0186330577 1.712325e-02 0.0146759071
## 13: carrierXE 0.0180173076 2.710544e-02 0.0183448838
## 14: originLAX 0.0173382444 3.954304e-02 0.0150835711
## 15: carrierOH 0.0164748289 2.819657e-02 0.0187525479
## 16: destMSP 0.0164098304 2.823096e-02 0.0146759071
## 17: carrierUS 0.0152987935 1.000503e-02 0.0240521810
## 18: destORD 0.0099107499 1.755060e-02 0.0110069303
## 19: originSAN 0.0083819792 2.445547e-03 0.0118222585
## 20: destJFK 0.0067003938 7.079628e-03 0.0126375866
## 21: destSFO 0.0039798995 1.469421e-03 0.0085609458
## 22: originCLT 0.0029076657 1.580220e-03 0.0048919690
## 23: originLGA 0.0027392225 4.492030e-03 0.0048919690
## 24: originSFO 0.0026395247 1.791410e-03 0.0069302894
## 25: originATL 0.0013354590 7.010750e-04 0.0081532817
## 26: carrierUA 0.0012084412 8.488053e-04 0.0016306563
## 27: destSEA 0.0008175935 5.648849e-04 0.0008153282
## 28: originCVG 0.0005318946 8.144905e-05 0.0012229923
## Feature Gain Cover Frequency
Now, as with last time, build the best possible model you can to predict the output. Feel free to make use of the validation set to fit the final model (this is certainly not a requirement, just a suggestion).
Finally, once you have your predictions saved as a variable called pred, run the following code to produce your your results:
submit <- select(flights, obs_id, train_id)
submit <- mutate(submit, pred = pred)
write_csv(submit, "class13_submit.csv")
Now, upload the csv file to GitHub and you are done.